This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
The analysis in this worksheet/notebook would be structured to enable us provide univariate, bivariate and multivariate relationships, this research would provide answers to questions like how many homeowners borrow in a state, which year has the highest rate of borrowing homeowners, which state has the highest amount of borrowing homeowners, is there a correlation between your available bank credit and your borrowing capacity, identifying the frequency of the categorical variables; Term of loan, Borrower's employment status, year of loan, and loan status, are there differences between loans depending on the loan term?. We have 81 features/columns attributed to each record of loans in the dataset and the descriptions of the features are detailed below:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.color_palette("colorblind")
%matplotlib inline
# loading the dataset...
df = pd.read_csv('prosperLoanData.csv')
#...and calling the head property to see the first five(5) values
df.head(5)
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
#alright, let us see our data's column structure
df.columns
Index(['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade',
'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate',
'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss',
'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)',
'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState',
'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration',
'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey',
'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines',
'TotalCreditLinespast7years', 'OpenRevolvingAccounts',
'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries',
'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years',
'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
'RevolvingCreditBalance', 'BankcardUtilization',
'AvailableBankcardCredit', 'TotalTrades',
'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months',
'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable',
'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans',
'TotalProsperPaymentsBilled', 'OnTimeProsperPayments',
'ProsperPaymentsLessThanOneMonthLate',
'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed',
'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing',
'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber',
'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount',
'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
'MonthlyLoanPayment', 'LP_CustomerPayments',
'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees',
'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss',
'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations',
'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
'Investors'],
dtype='object')
# Phew, that is a lot, let us explore our data further by checking
# the structure in other words, shape of our dataset
df.shape
(113937, 81)
We have 113937 rows and 81 columns, that is a lot, but we will not be needing them all for our questions, before then, let us explore our data further to check for some errors or correction we might need to make.
#checking our dataset info,
# data type and our column count to see if all entries are accurate
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
#Let us check for duplicated values
df[df.duplicated()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
|---|
0 rows × 81 columns
#No duplicates! Good now lets check for null values in one of our columns
df[df.LoanStatus.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
|---|
0 rows × 81 columns
We needed to check if all the values are entered for the column hence why we checked for the null values in the column.
df.Term.value_counts()
36 87778 60 24545 12 1614 Name: Term, dtype: int64
#it is not in category format but instead in numerical, we need to
#change this
df.Term.dtype
dtype('int64')
#Convert to a categorical data type with each category named
df.Term = df.Term.map({36: 'Medium Term', 60:'Long Term', 12:'Short Term'})
df.Term.value_counts()
Medium Term 87778 Long Term 24545 Short Term 1614 Name: Term, dtype: int64
#Good, now lets convert to a categorical data type, just to be sure
df.Term = df.Term.astype('category')
df.Term.dtype
CategoricalDtype(categories=['Long Term', 'Medium Term', 'Short Term'], ordered=False)
df.LoanStatus.value_counts()
Current 56576 Completed 38074 Chargedoff 11992 Defaulted 5018 Past Due (1-15 days) 806 Past Due (31-60 days) 363 Past Due (61-90 days) 313 Past Due (91-120 days) 304 Past Due (16-30 days) 265 FinalPaymentInProgress 205 Past Due (>120 days) 16 Cancelled 5 Name: LoanStatus, dtype: int64
df.LoanStatus = df.LoanStatus.astype('category')
df.LoanStatus.dtype
CategoricalDtype(categories=['Cancelled', 'Chargedoff', 'Completed', 'Current',
'Defaulted', 'FinalPaymentInProgress',
'Past Due (1-15 days)', 'Past Due (16-30 days)',
'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)'],
, ordered=False)
# Since all past due dates are due dates,let us convert the Past
# Due with days to a single past due statement
df.LoanStatus = df.LoanStatus.replace({'Past Due (1-15 days)':'Past Due', 'Past Due (16-30 days)':'Past Due',
'Past Due (31-60 days)':'Past Due', 'Past Due (61-90 days)' :'Past Due',
'Past Due (91-120 days)' :'Past Due', 'Past Due (>120 days)' :'Past Due'})
df.LoanStatus.value_counts()
Current 56576 Completed 38074 Chargedoff 11992 Defaulted 5018 Past Due 2067 FinalPaymentInProgress 205 Cancelled 5 Name: LoanStatus, dtype: int64
# Good, Let us proceed
# Let us check Home Owners Column
df.IsBorrowerHomeowner.value_counts()
True 57478 False 56459 Name: IsBorrowerHomeowner, dtype: int64
# date loan was originated
df.LoanOriginationDate.value_counts()
2014-01-22 00:00:00 491
2013-11-13 00:00:00 490
2014-02-19 00:00:00 439
2013-10-16 00:00:00 434
2014-01-28 00:00:00 339
...
2006-02-03 00:00:00 1
2006-01-24 00:00:00 1
2005-11-18 00:00:00 1
2009-07-20 00:00:00 1
2005-11-15 00:00:00 1
Name: LoanOriginationDate, Length: 1873, dtype: int64
# loan origination date to be converted to Year, to assist our analysis better
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.LoanOriginationDate.dtype
dtype('<M8[ns]')
#Convert to Year instead, we are going to perform our analysis by year
df['Year'] = pd.DatetimeIndex(df['LoanOriginationDate']).year
df.Year
0 2007
1 2014
2 2007
3 2012
4 2013
...
113932 2013
113933 2011
113934 2013
113935 2011
113936 2014
Name: Year, Length: 113937, dtype: int64
# check data type
df.Year.dtype
dtype('int64')
# convert year to categorical variable
df.Year = df.Year.astype('category')
df.Year.dtype
CategoricalDtype(categories=[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014], ordered=False)
Let us check for other data, just to make sure we are on the right track
# value count of verfiable income
df.IncomeVerifiable.value_counts()
True 105268 False 8669 Name: IncomeVerifiable, dtype: int64
# number of recommedations values
df.Recommendations.value_counts()
0 109678 1 3516 2 568 3 108 4 26 5 14 9 6 7 5 6 4 8 3 18 2 16 2 14 1 21 1 24 1 19 1 39 1 Name: Recommendations, dtype: int64
# emolyment status values
df.EmploymentStatus.value_counts()
Employed 67322 Full-time 26355 Self-employed 6134 Not available 5347 Other 3806 Part-time 1088 Not employed 835 Retired 795 Name: EmploymentStatus, dtype: int64
# sum of missing values
df.EmploymentStatus.isnull().sum()
2255
# EmploymentStatus missing values records
df[df.EmploymentStatus.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | 109D3366077649289619396 | 24135 | 2006-07-11 17:14:14.647000000 | B | Medium Term | Completed | 2009-10-22 00:00:00 | 0.15211 | 0.1450 | 0.1400 | ... | -299.78 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 44 | 2006 |
| 92 | 0FE433661215082408720F0 | 22591 | 2006-07-02 11:05:02.857000000 | AA | Medium Term | Completed | 2007-03-01 00:00:00 | 0.08292 | 0.0755 | 0.0705 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 21 | 2006 |
| 146 | 34B03365511229087456DB5 | 31745 | 2006-08-16 09:43:10.867000000 | A | Medium Term | Completed | 2006-12-04 00:00:00 | 0.09939 | 0.0925 | 0.0850 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 67 | 2006 |
| 162 | 10463364871408702EBC041 | 17578 | 2006-06-03 07:46:36.343000000 | AA | Medium Term | Completed | 2007-01-09 00:00:00 | 0.08483 | 0.0780 | 0.0730 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 93 | 2006 |
| 193 | 0F6133644711535583E92C3 | 17091 | 2006-05-31 23:48:17.177000000 | E | Medium Term | Defaulted | 2007-10-08 00:00:00 | 0.23937 | 0.2300 | 0.2250 | ... | 0.00 | 1590.40 | 1584.56 | 0.0 | 1.0 | 0 | 0 | 0.0 | 13 | 2006 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113663 | DBD7336509210597772988B | 21182 | 2006-06-22 12:04:17.483000000 | HR | Medium Term | Chargedoff | 2008-12-06 00:00:00 | 0.29525 | 0.2875 | 0.2450 | ... | 0.00 | 1081.21 | 1081.22 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 | 2006 |
| 113712 | DC1033648016334634330F6 | 18121 | 2006-06-06 15:08:32.807000000 | HR | Medium Term | Defaulted | 2007-05-16 00:00:00 | 0.24502 | 0.2375 | 0.2325 | ... | -93.79 | 5315.12 | 5340.58 | 0.0 | 1.0 | 0 | 0 | 0.0 | 51 | 2006 |
| 113902 | E6D13366465509973F5E90A | 3579 | 2006-03-26 00:11:04.620000000 | A | Medium Term | Completed | 2006-05-24 00:00:00 | 0.08805 | 0.0812 | 0.0750 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 143 | 2006 |
| 113914 | D20533652658825353F6597 | 30223 | 2006-08-09 14:34:40.010000000 | HR | Medium Term | Defaulted | 2007-03-19 00:00:00 | 0.25757 | 0.2500 | 0.2450 | ... | 0.00 | 2883.01 | 2883.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2006 |
| 113921 | D21A33647075665665BE266 | 4 | 2005-11-09 20:44:28.847000000 | AA | Medium Term | Completed | 2005-11-25 00:00:00 | NaN | 0.0400 | 0.0350 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2005 |
2255 rows × 82 columns
# Percentage (%) of missing values
df.EmploymentStatus.isnull().mean() * 100
1.9791639239228696
We can see that, the column data type in Employment Status should be categorical with a 2255 sum of missing values representing 1.98% of missing values in the colum, replacing the missing value with the most occuring employment status(because we need accuracy in our data) and changing the data type to categorical data type will suffice.
# replace missing value with most occured value
df.EmploymentStatus.fillna('Employed', inplace = True)
#and change data type to categorical
df.EmploymentStatus = df.EmploymentStatus.astype('category')
# check data type
df.EmploymentStatus.dtype
CategoricalDtype(categories=['Employed', 'Full-time', 'Not available', 'Not employed',
'Other', 'Part-time', 'Retired', 'Self-employed'],
, ordered=False)
#Good, let us continue our data exploration
# stated monthly income
df.StatedMonthlyIncome
0 3083.333333
1 6125.000000
2 2083.333333
3 2875.000000
4 9583.333333
...
113932 4333.333333
113933 8041.666667
113934 2875.000000
113935 3875.000000
113936 4583.333333
Name: StatedMonthlyIncome, Length: 113937, dtype: float64
# number of days of loan delinquency values
df.LoanCurrentDaysDelinquent.value_counts()
0 94860
121 293
10 192
4 77
11 62
...
582 1
1099 1
2555 1
787 1
1081 1
Name: LoanCurrentDaysDelinquent, Length: 2411, dtype: int64
# sorted from highest days of delinquency to lowest
df.LoanCurrentDaysDelinquent.sort_values(ascending = False, ignore_index= True)
0 2704
1 2703
2 2613
3 2612
4 2599
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: LoanCurrentDaysDelinquent, Length: 113937, dtype: int64
# stated monthly income
df.StatedMonthlyIncome
0 3083.333333
1 6125.000000
2 2083.333333
3 2875.000000
4 9583.333333
...
113932 4333.333333
113933 8041.666667
113934 2875.000000
113935 3875.000000
113936 4583.333333
Name: StatedMonthlyIncome, Length: 113937, dtype: float64
Let us round up the Stated monthly income so we'll have a round figure.
# round up to two decimal
df.StatedMonthlyIncome = df.StatedMonthlyIncome.astype(int)
df.StatedMonthlyIncome
0 3083
1 6125
2 2083
3 2875
4 9583
...
113932 4333
113933 8041
113934 2875
113935 3875
113936 4583
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
#Let us sort from highest to lowest
df.StatedMonthlyIncome.sort_values(ascending = False, ignore_index= True)
0 1750002
1 618547
2 483333
3 466666
4 416666
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: StatedMonthlyIncome, Length: 113937, dtype: int32
#Let us sort from highest to lowest our LoanOriginalAmount
df.LoanOriginalAmount.sort_values(ascending = False, ignore_index= True)
0 35000
1 35000
2 35000
3 35000
4 35000
...
113932 1000
113933 1000
113934 1000
113935 1000
113936 1000
Name: LoanOriginalAmount, Length: 113937, dtype: int64
FURTHER EXPLORATIONS
# monthly loan payment data type
df.MonthlyLoanPayment.dtype
dtype('float64')
# convert from float to integer
df.MonthlyLoanPayment = df.MonthlyLoanPayment.astype(int)
df.MonthlyLoanPayment.dtype
dtype('int32')
# sort from highest monthly loan payment to lowest
df.MonthlyLoanPayment.sort_values(ascending = False, ignore_index= True)
0 2251
1 2218
2 2179
3 2163
4 2153
...
113932 0
113933 0
113934 0
113935 0
113936 0
Name: MonthlyLoanPayment, Length: 113937, dtype: int32
# sum of missing values in occupation column
df.Occupation.isnull().sum()
3588
# let us see what the records are
df[df.Occupation.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34 | 0F6C3603015887476F3F015 | 1180690 | 2014-02-24 12:55:20.430000000 | NaN | Medium Term | Current | NaN | 0.22966 | 0.1920 | 0.1820 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2014 |
| 39 | 109D3366077649289619396 | 24135 | 2006-07-11 17:14:14.647000000 | B | Medium Term | Completed | 2009-10-22 00:00:00 | 0.15211 | 0.1450 | 0.1400 | ... | -299.78 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 44 | 2006 |
| 92 | 0FE433661215082408720F0 | 22591 | 2006-07-02 11:05:02.857000000 | AA | Medium Term | Completed | 2007-03-01 00:00:00 | 0.08292 | 0.0755 | 0.0705 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 21 | 2006 |
| 146 | 34B03365511229087456DB5 | 31745 | 2006-08-16 09:43:10.867000000 | A | Medium Term | Completed | 2006-12-04 00:00:00 | 0.09939 | 0.0925 | 0.0850 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 67 | 2006 |
| 161 | 10453596739685417502B89 | 1086023 | 2013-12-20 14:57:44.580000000 | NaN | Medium Term | Current | NaN | 0.17151 | 0.1355 | 0.1255 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 2 | 2013 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113709 | DC043594246199576475BB4 | 1017315 | 2013-11-05 05:44:05.173000000 | NaN | Medium Term | Current | NaN | 0.33215 | 0.2925 | 0.2825 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2013 |
| 113712 | DC1033648016334634330F6 | 18121 | 2006-06-06 15:08:32.807000000 | HR | Medium Term | Defaulted | 2007-05-16 00:00:00 | 0.24502 | 0.2375 | 0.2325 | ... | -93.79 | 5315.12 | 5340.58 | 0.0 | 1.0 | 0 | 0 | 0.0 | 51 | 2006 |
| 113902 | E6D13366465509973F5E90A | 3579 | 2006-03-26 00:11:04.620000000 | A | Medium Term | Completed | 2006-05-24 00:00:00 | 0.08805 | 0.0812 | 0.0750 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 143 | 2006 |
| 113914 | D20533652658825353F6597 | 30223 | 2006-08-09 14:34:40.010000000 | HR | Medium Term | Defaulted | 2007-03-19 00:00:00 | 0.25757 | 0.2500 | 0.2450 | ... | 0.00 | 2883.01 | 2883.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2006 |
| 113921 | D21A33647075665665BE266 | 4 | 2005-11-09 20:44:28.847000000 | AA | Medium Term | Completed | 2005-11-25 00:00:00 | NaN | 0.0400 | 0.0350 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 | 2005 |
3588 rows × 82 columns
# Percentage (%) of missing values in occupation
df.Occupation.isnull().mean()* 100
3.149108717975723
In our occupation value column, we have 3588 values that are null, amounting to 3.149% of key values missing. hence, we have to fill in the null values with a keyword that is identifiable;"Unknown"
We also need to sort out our Occupation Value column in a categorical way to ensure that we have a categorical data
# value count of occupation
df.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
# list of all occuption recorded in our dataset
list(df.Occupation)
['Other', 'Professional', 'Other', 'Skilled Labor', 'Executive', 'Professional', 'Sales - Retail', 'Laborer', 'Food Service', 'Food Service', 'Fireman', 'Waiter/Waitress', 'Sales - Retail', 'Construction', 'Computer Programmer', 'Other', 'Professional', 'Professional', 'Sales - Commission', 'Laborer', 'Retail Management', 'Professional', 'Other', 'Skilled Labor', 'Other', 'Engineer - Mechanical', 'Sales - Commission', 'Executive', 'Military Enlisted', 'Other', 'Other', 'Other', 'Clerical', 'Other', nan, 'Retail Management', 'Professional', 'Teacher', 'Other', nan, 'Other', 'Other', 'Other', 'Clergy', 'Professional', 'Executive', 'Accountant/CPA', 'Professional', 'Attorney', 'Professional', 'Nurse (RN)', 'Nurse (RN)', 'Other', 'Accountant/CPA', 'Executive', 'Retail Management', 'Retail Management', 'Construction', 'Clergy', 'Other', 'Teacher', 'Analyst', 'Other', 'Analyst', 'Sales - Commission', 'Other', 'Other', 'Professional', "Nurse's Aide", 'Investor', 'Retail Management', 'Other', 'Other', 'Other', 'Fireman', 'Skilled Labor', 'Executive', 'Analyst', 'Realtor', 'Skilled Labor', 'Other', 'Professional', 'Other', 'Accountant/CPA', 'Flight Attendant', 'Other', 'Nurse (LPN)', 'Professional', 'Construction', 'Construction', 'Clerical', 'Other', nan, 'Laborer', 'Military Officer', 'Analyst', 'Clerical', 'Sales - Retail', 'Computer Programmer', 'Other', 'Other', 'Food Service Management', 'Professional', 'Truck Driver', 'Other', 'Sales - Commission', 'Other', 'Administrative Assistant', 'Executive', 'Police Officer/Correction Officer', 'Nurse (RN)', 'Laborer', 'Social Worker', 'Other', 'Computer Programmer', 'Professional', 'Executive', 'Professional', 'Teacher', 'Clerical', 'Laborer', 'Military Officer', 'Other', 'Sales - Retail', 'Construction', 'Other', 'Sales - Commission', 'Administrative Assistant', 'Other', 'Retail Management', 'Food Service Management', 'Other', 'Military Enlisted', 'Other', 'Professional', 'Skilled Labor', 'Sales - Commission', 'Construction', 'Military Enlisted', 'Other', 'Accountant/CPA', 'Truck Driver', 'Analyst', 'Fireman', 'Sales - Commission', 'Tradesman - Mechanic', nan, 'Professional', 'Accountant/CPA', 'Professional', 'Professional', 'Executive', 'Skilled Labor', 'Skilled Labor', 'Teacher', 'Computer Programmer', 'Professional', 'Food Service', 'Tradesman - Mechanic', 'Analyst', 'Military Officer', nan, nan, 'Other', 'Other', 'Other', 'Executive', 'Medical Technician', 'Executive', 'Professor', 'Other', 'Tradesman - Mechanic', 'Other', 'Skilled Labor', 'Teacher', 'Postal Service', 'Sales - Commission', 'Social Worker', 'Other', 'Postal Service', 'Other', 'Sales - Retail', 'Computer Programmer', 'Attorney', 'Military Enlisted', 'Teacher', 'Other', 'Other', 'Other', 'Sales - Commission', 'Professional', 'Analyst', 'Military Enlisted', nan, 'Other', 'Nurse (RN)', 'Other', 'Waiter/Waitress', 'Professional', 'Civil Service', 'Executive', 'Other', 'Medical Technician', 'Other', 'Teacher', 'Other', nan, 'Retail Management', 'Other', 'Accountant/CPA', 'Analyst', 'Other', 'Professional', 'Sales - Commission', 'Professional', 'Realtor', 'Investor', 'Food Service', 'Other', 'Professional', 'Police Officer/Correction Officer', 'Other', 'Pharmacist', 'Administrative Assistant', 'Sales - Commission', 'Nurse (RN)', 'Other', 'Professional', 'Truck Driver', nan, 'Other', 'Skilled Labor', 'Other', 'Other', 'Nurse (RN)', 'Truck Driver', 'Tradesman - Mechanic', nan, 'Other', 'Professional', 'Laborer', 'Administrative Assistant', 'Professional', 'Construction', 'Truck Driver', 'Computer Programmer', 'Pharmacist', nan, 'Administrative Assistant', 'Teacher', nan, 'Other', 'Clerical', 'Other', 'Other', 'Computer Programmer', 'Computer Programmer', 'Clerical', 'Administrative Assistant', 'Sales - Retail', 'Professional', 'Other', 'Police Officer/Correction Officer', 'Professional', 'Food Service Management', 'Retail Management', 'Other', 'Other', 'Other', 'Other', 'Other', 'Computer Programmer', 'Tradesman - Electrician', 'Other', 'Computer Programmer', 'Professional', 'Other', 'Other', 'Truck Driver', 'Scientist', 'Engineer - Mechanical', 'Skilled Labor', 'Other', 'Teacher', 'Accountant/CPA', 'Retail Management', 'Accountant/CPA', 'Dentist', 'Engineer - Electrical', 'Other', 'Skilled Labor', 'Sales - Commission', 'Professional', 'Clerical', 'Retail Management', 'Other', 'Administrative Assistant', 'Other', 'Retail Management', 'Teacher', 'Computer Programmer', 'Civil Service', 'Sales - Retail', 'Architect', 'Professional', 'Other', 'Other', 'Professional', 'Landscaping', 'Analyst', nan, 'Nurse (RN)', 'Administrative Assistant', 'Professional', 'Executive', 'Landscaping', 'Administrative Assistant', 'Professional', 'Clerical', 'Executive', 'Construction', 'Professional', 'Analyst', 'Executive', 'Laborer', nan, 'Sales - Retail', 'Professional', 'Nurse (LPN)', 'Waiter/Waitress', 'Executive', 'Other', nan, 'Flight Attendant', 'Executive', 'Other', 'Professional', 'Retail Management', 'Tradesman - Carpenter', 'Other', 'Waiter/Waitress', 'Administrative Assistant', 'Other', 'Tradesman - Electrician', 'Medical Technician', 'Other', 'Retail Management', 'Truck Driver', 'Analyst', nan, 'Skilled Labor', 'Sales - Commission', 'Executive', 'Professional', 'Truck Driver', 'Police Officer/Correction Officer', nan, 'Civil Service', 'Professional', 'Other', 'Professional', 'Clerical', 'Accountant/CPA', 'Executive', 'Bus Driver', 'Professional', 'Sales - Commission', 'Sales - Retail', 'Sales - Retail', 'Food Service', 'Skilled Labor', 'Computer Programmer', 'Postal Service', 'Other', 'Executive', 'Other', 'Nurse (LPN)', 'Analyst', 'Medical Technician', 'Other', 'Professional', 'Other', 'Engineer - Mechanical', 'Teacher', 'Tradesman - Plumber', 'Retail Management', 'Professional', nan, 'Construction', 'Clerical', 'Professional', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Other', 'Tradesman - Electrician', 'Professional', 'Professional', 'Administrative Assistant', 'Executive', 'Computer Programmer', 'Other', 'Other', 'Bus Driver', 'Teacher', 'Architect', 'Computer Programmer', 'Executive', 'Other', 'Other', 'Engineer - Chemical', 'Computer Programmer', 'Other', 'Computer Programmer', 'Professional', 'Food Service', 'Other', 'Doctor', 'Other', 'Construction', 'Other', 'Analyst', 'Executive', 'Other', 'Administrative Assistant', 'Chemist', 'Student - College Senior', 'Other', 'Principal', 'Clerical', 'Laborer', 'Other', 'Other', 'Civil Service', 'Skilled Labor', 'Executive', 'Professional', 'Attorney', 'Professional', 'Professional', 'Other', 'Other', 'Sales - Commission', 'Other', 'Accountant/CPA', 'Social Worker', 'Sales - Retail', nan, 'Truck Driver', 'Skilled Labor', 'Administrative Assistant', 'Scientist', 'Medical Technician', 'Computer Programmer', 'Sales - Commission', 'Computer Programmer', 'Military Enlisted', 'Engineer - Mechanical', 'Other', 'Doctor', 'Civil Service', 'Skilled Labor', 'Doctor', 'Sales - Retail', "Teacher's Aide", 'Police Officer/Correction Officer', 'Skilled Labor', 'Other', 'Professional', 'Professional', 'Executive', nan, 'Landscaping', 'Other', 'Other', 'Other', 'Other', 'Teacher', 'Other', 'Waiter/Waitress', 'Other', 'Administrative Assistant', 'Other', 'Professional', 'Military Enlisted', 'Nurse (RN)', 'Analyst', 'Skilled Labor', 'Other', 'Other', 'Analyst', 'Pilot - Private/Commercial', 'Other', 'Other', 'Other', 'Executive', 'Other', 'Professional', 'Engineer - Mechanical', 'Other', 'Sales - Commission', 'Engineer - Electrical', 'Skilled Labor', 'Computer Programmer', 'Other', 'Other', 'Nurse (RN)', "Nurse's Aide", 'Other', 'Retail Management', 'Other', 'Pharmacist', 'Scientist', 'Professional', 'Analyst', 'Military Enlisted', "Teacher's Aide", 'Other', 'Other', 'Sales - Retail', 'Doctor', 'Teacher', 'Professional', 'Skilled Labor', 'Other', 'Accountant/CPA', 'Professional', 'Food Service', 'Food Service', 'Sales - Retail', 'Executive', nan, 'Engineer - Electrical', 'Professional', 'Other', 'Other', 'Professional', 'Other', 'Analyst', 'Truck Driver', 'Other', 'Other', 'Truck Driver', 'Sales - Retail', 'Other', 'Analyst', 'Professional', 'Religious', 'Computer Programmer', 'Other', 'Truck Driver', 'Other', 'Administrative Assistant', 'Other', 'Analyst', nan, 'Other', 'Other', 'Retail Management', 'Food Service', 'Other', 'Other', 'Executive', 'Other', 'Homemaker', 'Engineer - Electrical', 'Engineer - Mechanical', 'Executive', 'Police Officer/Correction Officer', 'Food Service Management', 'Retail Management', 'Professional', 'Executive', nan, 'Truck Driver', 'Homemaker', 'Professional', 'Professional', 'Other', 'Food Service', nan, 'Retail Management', 'Nurse (RN)', 'Analyst', nan, 'Accountant/CPA', 'Other', 'Clerical', 'Professional', nan, 'Analyst', 'Professional', 'Nurse (RN)', 'Analyst', 'Sales - Commission', 'Nurse (RN)', 'Retail Management', 'Computer Programmer', 'Scientist', 'Executive', 'Professional', 'Sales - Commission', 'Other', 'Professional', 'Other', 'Analyst', 'Accountant/CPA', 'Attorney', 'Analyst', 'Other', 'Executive', 'Administrative Assistant', 'Other', 'Retail Management', 'Truck Driver', 'Nurse (RN)', 'Truck Driver', 'Other', 'Tradesman - Mechanic', 'Sales - Retail', 'Executive', 'Tradesman - Electrician', 'Other', 'Clergy', 'Accountant/CPA', 'Other', 'Other', 'Other', 'Laborer', 'Other', 'Other', 'Other', 'Analyst', 'Other', 'Realtor', 'Teacher', 'Doctor', 'Engineer - Electrical', 'Social Worker', 'Professional', 'Teacher', 'Engineer - Mechanical', 'Executive', 'Sales - Retail', 'Professional', 'Executive', 'Professional', 'Nurse (RN)', nan, 'Computer Programmer', 'Professional', 'Fireman', 'Other', nan, 'Student - College Graduate Student', 'Other', 'Religious', 'Analyst', 'Sales - Retail', 'Military Enlisted', 'Professional', 'Professional', 'Accountant/CPA', 'Civil Service', 'Investor', 'Other', 'Realtor', 'Professional', 'Professional', 'Accountant/CPA', 'Truck Driver', 'Truck Driver', 'Other', 'Military Enlisted', 'Skilled Labor', 'Professor', 'Accountant/CPA', 'Attorney', 'Other', 'Other', 'Attorney', 'Other', 'Tradesman - Electrician', 'Truck Driver', 'Skilled Labor', 'Other', 'Pilot - Private/Commercial', 'Other', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Medical Technician', 'Professional', 'Professional', 'Other', 'Computer Programmer', nan, 'Other', 'Realtor', 'Professional', 'Analyst', 'Administrative Assistant', "Nurse's Aide", 'Other', 'Police Officer/Correction Officer', 'Executive', 'Laborer', 'Professor', 'Professional', 'Professional', 'Professional', 'Other', 'Teacher', 'Tradesman - Electrician', 'Other', 'Tradesman - Mechanic', 'Police Officer/Correction Officer', 'Analyst', 'Other', 'Professional', 'Tradesman - Mechanic', 'Other', 'Computer Programmer', 'Food Service Management', 'Police Officer/Correction Officer', 'Accountant/CPA', 'Professional', 'Other', 'Other', 'Analyst', 'Other', 'Other', 'Accountant/CPA', 'Architect', 'Other', 'Other', 'Other', 'Other', 'Teacher', 'Other', "Nurse's Aide", 'Computer Programmer', "Nurse's Aide", 'Analyst', 'Police Officer/Correction Officer', 'Tradesman - Plumber', 'Professional', 'Other', 'Professor', 'Professional', 'Truck Driver', 'Sales - Commission', 'Professional', 'Other', 'Other', 'Doctor', 'Sales - Commission', 'Administrative Assistant', 'Professional', 'Professional', 'Student - College Senior', 'Teacher', 'Retail Management', 'Computer Programmer', 'Executive', 'Other', 'Police Officer/Correction Officer', 'Executive', 'Administrative Assistant', 'Nurse (RN)', 'Other', 'Other', 'Other', 'Professional', 'Engineer - Electrical', 'Other', 'Sales - Retail', 'Nurse (RN)', 'Teacher', 'Clerical', 'Sales - Retail', 'Clerical', 'Tradesman - Mechanic', 'Civil Service', 'Other', 'Other', 'Other', 'Engineer - Mechanical', 'Teacher', 'Teacher', 'Professional', 'Tradesman - Mechanic', 'Other', 'Nurse (RN)', 'Other', 'Retail Management', 'Police Officer/Correction Officer', 'Accountant/CPA', 'Other', 'Sales - Retail', 'Student - Technical School', 'Other', 'Other', 'Food Service Management', 'Professional', "Nurse's Aide", 'Teacher', 'Professional', 'Retail Management', 'Other', 'Other', 'Other', 'Other', 'Professional', 'Sales - Commission', 'Executive', 'Other', 'Truck Driver', 'Other', 'Pilot - Private/Commercial', 'Administrative Assistant', 'Sales - Retail', 'Executive', 'Other', 'Other', 'Other', 'Computer Programmer', 'Other', 'Executive', 'Other', nan, 'Other', 'Other', 'Nurse (LPN)', nan, 'Accountant/CPA', 'Police Officer/Correction Officer', 'Professional', 'Engineer - Mechanical', 'Investor', nan, 'Skilled Labor', 'Military Officer', 'Sales - Commission', 'Social Worker', 'Executive', 'Professional', 'Student - College Senior', 'Professional', 'Laborer', 'Professional', nan, 'Professional', 'Retail Management', 'Other', 'Skilled Labor', 'Teacher', 'Executive', 'Administrative Assistant', 'Engineer - Mechanical', 'Other', 'Other', 'Other', 'Computer Programmer', 'Nurse (RN)', 'Administrative Assistant', 'Other', 'Executive', 'Construction', 'Principal', 'Administrative Assistant', 'Tradesman - Mechanic', 'Other', 'Professional', 'Police Officer/Correction Officer', 'Analyst', 'Other', 'Teacher', 'Fireman', 'Other', 'Professional', 'Executive', 'Teacher', 'Skilled Labor', 'Construction', nan, 'Police Officer/Correction Officer', 'Postal Service', 'Other', 'Professional', 'Other', 'Nurse (LPN)', 'Construction', 'Food Service', 'Police Officer/Correction Officer', 'Other', 'Professional', 'Accountant/CPA', 'Executive', 'Professional', 'Civil Service', 'Construction', 'Analyst', nan, 'Teacher', nan, 'Other', 'Other', 'Sales - Commission', 'Other', 'Clerical', 'Computer Programmer', 'Truck Driver', 'Medical Technician', 'Nurse (RN)', 'Other', 'Other', 'Medical Technician', 'Professional', 'Other', 'Other', 'Accountant/CPA', 'Laborer', 'Other', 'Sales - Commission', 'Other', 'Other', 'Teacher', 'Nurse (RN)', 'Retail Management', 'Administrative Assistant', 'Investor', 'Professional', 'Other', 'Other', 'Computer Programmer', 'Laborer', 'Laborer', 'Teacher', 'Clerical', 'Psychologist', 'Professional', 'Other', 'Retail Management', 'Professional', 'Truck Driver', 'Homemaker', 'Retail Management', 'Food Service Management', 'Other', 'Professional', 'Sales - Retail', 'Principal', 'Other', 'Sales - Retail', 'Computer Programmer', 'Realtor', 'Administrative Assistant', 'Other', 'Sales - Retail', 'Sales - Commission', 'Retail Management', 'Medical Technician', 'Accountant/CPA', 'Computer Programmer', 'Teacher', 'Tradesman - Mechanic', 'Social Worker', 'Retail Management', 'Engineer - Mechanical', 'Engineer - Mechanical', 'Other', 'Retail Management', nan, 'Police Officer/Correction Officer', 'Professional', 'Other', 'Skilled Labor', 'Other', 'Skilled Labor', 'Computer Programmer', 'Fireman', 'Clerical', 'Professional', 'Computer Programmer', 'Civil Service', 'Other', 'Other', 'Professional', 'Computer Programmer', 'Biologist', 'Truck Driver', 'Executive', 'Medical Technician', 'Accountant/CPA', 'Postal Service', 'Other', 'Other', ...]
#A lot!, but no problem, data type?
df.Occupation.dtype
dtype('O')
I ran into an error here, I changed the data type first before attempting to fill in the null values and it refused to work, meaning, I will have to covert back to object and fill in the na values with "Unknown" before switching back to coverting the data type.
# convert to category data type
df.Occupation = df.Occupation.astype('object')
df.Occupation.dtype
dtype('O')
# replace null with unknown
df.Occupation.fillna('Unknown', inplace = True)
# check for null records
df[df.Occupation.isnull()]
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | Year |
|---|
0 rows × 82 columns
# Good, let us now convert to categorical data type
df.Occupation = df.Occupation.astype('category')
df.Occupation.dtype
CategoricalDtype(categories=['Accountant/CPA', 'Administrative Assistant', 'Analyst',
'Architect', 'Attorney', 'Biologist', 'Bus Driver',
'Car Dealer', 'Chemist', 'Civil Service', 'Clergy',
'Clerical', 'Computer Programmer', 'Construction', 'Dentist',
'Doctor', 'Engineer - Chemical', 'Engineer - Electrical',
'Engineer - Mechanical', 'Executive', 'Fireman',
'Flight Attendant', 'Food Service',
'Food Service Management', 'Homemaker', 'Investor', 'Judge',
'Laborer', 'Landscaping', 'Medical Technician',
'Military Enlisted', 'Military Officer', 'Nurse (LPN)',
'Nurse (RN)', 'Nurse's Aide', 'Other', 'Pharmacist',
'Pilot - Private/Commercial',
'Police Officer/Correction Officer', 'Postal Service',
'Principal', 'Professional', 'Professor', 'Psychologist',
'Realtor', 'Religious', 'Retail Management',
'Sales - Commission', 'Sales - Retail', 'Scientist',
'Skilled Labor', 'Social Worker',
'Student - College Freshman',
'Student - College Graduate Student',
'Student - College Junior', 'Student - College Senior',
'Student - College Sophomore', 'Student - Community College',
'Student - Technical School', 'Teacher', 'Teacher's Aide',
'Tradesman - Carpenter', 'Tradesman - Electrician',
'Tradesman - Mechanic', 'Tradesman - Plumber',
'Truck Driver', 'Unknown', 'Waiter/Waitress'],
, ordered=False)
# number of investor value count
df.Investors.value_counts()
1 27814
2 1386
3 991
4 827
5 753
...
665 1
634 1
555 1
752 1
754 1
Name: Investors, Length: 751, dtype: int64
# borrower state value count
df.BorrowerState.value_counts()
CA 14717 TX 6842 NY 6729 FL 6720 IL 5921 GA 5008 OH 4197 MI 3593 VA 3278 NJ 3097 NC 3084 WA 3048 PA 2972 MD 2821 MO 2615 MN 2318 MA 2242 CO 2210 IN 2078 AZ 1901 WI 1842 OR 1817 TN 1737 AL 1679 CT 1627 SC 1122 NV 1090 KS 1062 KY 983 OK 971 LA 954 UT 877 AR 855 MS 787 NE 674 ID 599 NH 551 NM 472 RI 435 HI 409 WV 391 DC 382 MT 330 DE 300 VT 207 AK 200 SD 189 IA 186 WY 150 ME 101 ND 52 Name: BorrowerState, dtype: int64
# borrower state sum of null values
df.BorrowerState.isnull().sum()
5515
# check data type
df.BorrowerState.dtype
dtype('O')
# borrower state % of null values
df.BorrowerState.isnull().mean() * 100
4.840394252964358
In our BORROWER STATE column, we have 5515 values that are null, amounting to 4.84% of key values missing. hence, we have to fill in the null values with a keyword that is identifiable;"UNIDENTIFIED"
We also need to sort out our BORROWER STATE column in a categorical way to ensure that we have a categorical data
# fill missing state with UNIDENTIFIED
df.BorrowerState.fillna('Unidentified', inplace = True)
#Let us check for null values again
df.BorrowerState.isnull().sum()
0
# Good, lets convert data type to categorical
df.BorrowerState = df.BorrowerState.astype('category')
# check data type
df.BorrowerState.dtype
CategoricalDtype(categories=['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA',
'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE',
'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI',
'SC', 'SD', 'TN', 'TX', 'UT', 'Unidentified', 'VA', 'VT',
'WA', 'WI', 'WV', 'WY'],
, ordered=False)
#Let us replace the abbreviated state to full state name
state = {'WA': 'WASHINGTON', 'VA': 'VIRGINIA', 'DE': 'DELAWARE',
'DC': 'DISTRICT OF COLUMBIA', 'WI': 'WISCONSIN', 'WV': 'WEST VIRGINIA',
'HI': 'HAWAII', 'FL': 'FLORIDA', 'WY': 'WYOMING',
'NH': 'NEW HAMPSHIRE', 'NJ': 'NEW JERSEY', 'NM': 'NEW MEXICO',
'TX': 'TEXAS', 'LA': 'LOUISIANA', 'NC': 'NORTH CAROLINA',
'ND': 'NORTH DAKOTA', 'NE': 'NEBRASKA', 'TN': 'TENNESSEE',
'NY': 'NEW YORK', 'PA': 'PENNSYLVANIA', 'CA': 'CALIFORNIA',
'NV': 'NEVADA', 'CO': 'COLORADO', 'VI': 'VIRGIN ISLANDS',
'AK': 'ALASKA', 'AL': 'ALABAMA', 'AR': 'ARKANSAS',
'VT': 'VERMONT', 'IL': 'ILLINOIS', 'GA': 'GEORGIA',
'IN': 'INDIANA', 'IA': 'IOWA', 'OK': 'OKLAHOMA',
'AZ': 'ARIZONA', 'ID': 'IDAHO', 'CT': 'CONNECTICUT',
'ME': 'MAINE', 'MD': 'MARYLAND', 'MA': 'MASSACHUSETTS',
'OH': 'OHIO', 'UT': 'UTAH', 'MO': 'MISSOURI',
'MN': 'MINNESOTA', 'MI': 'MICHIGAN', 'RI': 'RHODE ISLAND',
'KS': 'KANSAS', 'MT': 'MONTANA', 'MS': 'MISSISSIPPI',
'SC': 'SOUTH CAROLINA', 'KY': 'KENTUCKY', 'OR': 'OREGON',
'SD': 'SOUTH DAKOTA', 'Unknown': 'UNKNOWN' }
# repalace the abbreviated with the dictionary created
df.BorrowerState = df.BorrowerState.map(state)
#Check the Values
df.BorrowerState.value_counts()
CALIFORNIA 14717 TEXAS 6842 NEW YORK 6729 FLORIDA 6720 ILLINOIS 5921 GEORGIA 5008 OHIO 4197 MICHIGAN 3593 VIRGINIA 3278 NEW JERSEY 3097 NORTH CAROLINA 3084 WASHINGTON 3048 PENNSYLVANIA 2972 MARYLAND 2821 MISSOURI 2615 MINNESOTA 2318 MASSACHUSETTS 2242 COLORADO 2210 INDIANA 2078 ARIZONA 1901 WISCONSIN 1842 OREGON 1817 TENNESSEE 1737 ALABAMA 1679 CONNECTICUT 1627 SOUTH CAROLINA 1122 NEVADA 1090 KANSAS 1062 KENTUCKY 983 OKLAHOMA 971 LOUISIANA 954 UTAH 877 ARKANSAS 855 MISSISSIPPI 787 NEBRASKA 674 IDAHO 599 NEW HAMPSHIRE 551 NEW MEXICO 472 RHODE ISLAND 435 HAWAII 409 WEST VIRGINIA 391 DISTRICT OF COLUMBIA 382 MONTANA 330 DELAWARE 300 VERMONT 207 ALASKA 200 SOUTH DAKOTA 189 IOWA 186 WYOMING 150 MAINE 101 NORTH DAKOTA 52 Name: BorrowerState, dtype: int64
#Good,let us move on to the next stage
# structure of the dataframe
df.shape
(113937, 82)
Our dataset has 113,937 rows and 81 columns. However, we are not going to work with the whole datasets as our observations are not spanning the whole recorded observation. Our Questions for this dataset include but is not limited to; How many homeowners borrow in a state, which year has the highest rate of borrowing homeowners, which state has the highest amount of borrowing homeowners, is there a correlation between your available bank credit and your borrowing capacity, identifying the frequency of the categorical variables; Term of loan, Borrower's employment status, year of loan, and loan status, are there differences between loans depending on the loan term?.
Since we are not working with the whole dataset, let us reduce our dataset and combine the ones we will need into a data sheet
# select the column into a new datasheet
new_df = df[['LoanOriginationDate','Term', 'Year', 'LoanStatus', 'BorrowerState', 'Occupation',
'EmploymentStatus', 'IsBorrowerHomeowner', 'LoanCurrentDaysDelinquent',
'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanOriginalAmount',
'MonthlyLoanPayment','Recommendations', 'Investors','AvailableBankcardCredit']]
# save as a csv file
new_df.to_csv('new_df.csv', index = False)
#check if new file is saved
new_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LoanOriginationDate 113937 non-null datetime64[ns] 1 Term 113937 non-null category 2 Year 113937 non-null category 3 LoanStatus 113937 non-null category 4 BorrowerState 108422 non-null object 5 Occupation 113937 non-null category 6 EmploymentStatus 113937 non-null category 7 IsBorrowerHomeowner 113937 non-null bool 8 LoanCurrentDaysDelinquent 113937 non-null int64 9 IncomeVerifiable 113937 non-null bool 10 StatedMonthlyIncome 113937 non-null int32 11 LoanOriginalAmount 113937 non-null int64 12 MonthlyLoanPayment 113937 non-null int32 13 Recommendations 113937 non-null int64 14 Investors 113937 non-null int64 15 AvailableBankcardCredit 106393 non-null float64 dtypes: bool(2), category(5), datetime64[ns](1), float64(1), int32(2), int64(4), object(1) memory usage: 7.7+ MB
new_df.isnull().sum()
LoanOriginationDate 0 Term 0 Year 0 LoanStatus 0 BorrowerState 5515 Occupation 0 EmploymentStatus 0 IsBorrowerHomeowner 0 LoanCurrentDaysDelinquent 0 IncomeVerifiable 0 StatedMonthlyIncome 0 LoanOriginalAmount 0 MonthlyLoanPayment 0 Recommendations 0 Investors 0 AvailableBankcardCredit 7544 dtype: int64
So, our borrower state column has 5515 null values and Available Bank card credit too has 7544 missing values.Let us work on that before we continue
new_df.BorrowerState.fillna('Unknown', inplace = True)
C:\Users\mayowa.ebenezer\AppData\Local\Temp\ipykernel_17648\405810248.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df.BorrowerState.fillna('Unknown', inplace = True)
new_df.isnull().sum()
LoanOriginationDate 0 Term 0 Year 0 LoanStatus 0 BorrowerState 0 Occupation 0 EmploymentStatus 0 IsBorrowerHomeowner 0 LoanCurrentDaysDelinquent 0 IncomeVerifiable 0 StatedMonthlyIncome 0 LoanOriginalAmount 0 MonthlyLoanPayment 0 Recommendations 0 Investors 0 AvailableBankcardCredit 7544 dtype: int64
#Available Bank Card Credit up next
new_df.AvailableBankcardCredit.fillna('0', inplace = True)
#Check for null values again
new_df.isnull().sum()
C:\Users\mayowa.ebenezer\AppData\Local\Temp\ipykernel_17648\2755186625.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df.AvailableBankcardCredit.fillna('0', inplace = True)
LoanOriginationDate 0 Term 0 Year 0 LoanStatus 0 BorrowerState 0 Occupation 0 EmploymentStatus 0 IsBorrowerHomeowner 0 LoanCurrentDaysDelinquent 0 IncomeVerifiable 0 StatedMonthlyIncome 0 LoanOriginalAmount 0 MonthlyLoanPayment 0 Recommendations 0 Investors 0 AvailableBankcardCredit 0 dtype: int64
# Done, lets move on
# more stat info
new_df.describe()
| LoanCurrentDaysDelinquent | StatedMonthlyIncome | LoanOriginalAmount | MonthlyLoanPayment | Recommendations | Investors | |
|---|---|---|---|---|---|---|
| count | 113937.000000 | 1.139370e+05 | 113937.00000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 152.816539 | 5.607707e+03 | 8337.01385 | 271.978128 | 0.048027 | 80.475228 |
| std | 466.320254 | 7.478491e+03 | 6245.80058 | 192.701916 | 0.332353 | 103.239020 |
| min | 0.000000 | 0.000000e+00 | 1000.00000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 0.000000 | 3.200000e+03 | 4000.00000 | 131.000000 | 0.000000 | 2.000000 |
| 50% | 0.000000 | 4.666000e+03 | 6500.00000 | 217.000000 | 0.000000 | 44.000000 |
| 75% | 0.000000 | 6.825000e+03 | 12000.00000 | 371.000000 | 0.000000 | 115.000000 |
| max | 2704.000000 | 1.750002e+06 | 35000.00000 | 2251.000000 | 39.000000 | 1189.000000 |
#last ten values (tail)
new_df.tail(10)
| LoanOriginationDate | Term | Year | LoanStatus | BorrowerState | Occupation | EmploymentStatus | IsBorrowerHomeowner | LoanCurrentDaysDelinquent | IncomeVerifiable | StatedMonthlyIncome | LoanOriginalAmount | MonthlyLoanPayment | Recommendations | Investors | AvailableBankcardCredit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113927 | 2008-05-09 | Medium Term | 2008 | Completed | WASHINGTON | Executive | Full-time | True | 0 | True | 10333 | 4292 | 132 | 2 | 194 | 34415.0 |
| 113928 | 2011-06-10 | Medium Term | 2011 | Completed | COLORADO | Other | Full-time | False | 0 | True | 2333 | 2000 | 73 | 0 | 25 | 3746.0 |
| 113929 | 2013-07-10 | Medium Term | 2013 | Completed | FLORIDA | Accountant/CPA | Employed | False | 0 | True | 4333 | 2500 | 101 | 0 | 26 | 7537.0 |
| 113930 | 2013-07-10 | Medium Term | 2013 | Current | INDIANA | Professional | Employed | True | 0 | True | 6250 | 3000 | 106 | 0 | 52 | 1322.0 |
| 113931 | 2014-01-22 | Long Term | 2014 | Current | ILLINOIS | Analyst | Employed | False | 0 | True | 8146 | 25000 | 565 | 0 | 1 | 12056.0 |
| 113932 | 2013-04-22 | Medium Term | 2013 | Current | ILLINOIS | Food Service Management | Employed | True | 0 | True | 4333 | 10000 | 364 | 0 | 1 | 1886.0 |
| 113933 | 2011-11-07 | Medium Term | 2011 | FinalPaymentInProgress | PENNSYLVANIA | Professional | Employed | True | 0 | True | 8041 | 2000 | 65 | 0 | 22 | 6658.0 |
| 113934 | 2013-12-23 | Long Term | 2013 | Current | TEXAS | Other | Employed | True | 0 | True | 2875 | 10000 | 273 | 0 | 119 | 7853.0 |
| 113935 | 2011-11-21 | Long Term | 2011 | Completed | GEORGIA | Food Service | Full-time | True | 0 | True | 3875 | 15000 | 449 | 0 | 274 | 4137.0 |
| 113936 | 2014-01-21 | Medium Term | 2014 | Current | NEW YORK | Professor | Employed | False | 0 | True | 4583 | 2000 | 64 | 0 | 1 | 675.0 |
Let us look at the distributions in our data, and show some relationships between them, this should be fun!
#Using Histogram
def chart(x, title):
"""plot histogram to show
dist of numeric variable"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(12,10), dpi = 400)
plt.hist(x = x)
plt.title(title)
plt.xlabel('Amount in Dollars)', fontsize = 8)
plt.ylabel('Distribution', fontsize = 8)
chart(new_df.LoanOriginalAmount, 'Histogram Distibution of Loan Original Amount')
The graph above shows us that, the amount borrowed the most lies to the left of our graph, meaning the graph is RIGHT SKEWED, a case of symmetrical distribution. Most of the loaners (those who borrowed) borrowed at about the 5000 dollars column, there is however, an outlier in between the 30000 and 35000 dollars range.**
#Monthly loan payment graph
chart(new_df.MonthlyLoanPayment, 'Distibution of Monthly Loan Payment')
The data in the above graph are right-skewed, a case of symmetrical distribution
Lets explore further, lets explore the probability density function of the data points.
def density(x, title):
"""plot a kernel density estimate"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(12,10), dpi = 400)
sns.kdeplot(x = x, data = new_df, fill = True)
plt.xlabel('Amount(Dollars)')
plt.title(title)
density('LoanOriginalAmount', 'Density Estimate for Loan Original Amount')
The probability that a randomly chosen loan original amount will fall between 5000 dollars and 12000 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [5000, 12000].
density('MonthlyLoanPayment', 'Density Estimate for Monthly Loan Payment')
The probability that a randomly chosen monthly loan payment will fall between 300 dollars and 500 dollars can be calculated as the area between the density function (graph) and the x-axis in the interval [300, 500]
Let us Plot further on our data frame, to check for more relationships
def univariate_plot(x, title):
"""plot a countplot"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(8,6), dpi = 400)
ax = sns.countplot(x = x, data = new_df)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
plt.plot()
plt.show();
#Loan Term Distribution
univariate_plot('Term','Term of Loan Distribution.')
Loans Disbursed with a medium time frame (36 months) has the highest occurence, followed by the ones disbursed within the long term frame, if this should mean anything, it is that, people generally prefer medium to long term loan as opposed to short term loans that might destabilize their income.
univariate_plot('EmploymentStatus', "Employment Status Distribution.")
From the above graph, we can infer that those who are employed are the highest borrowers, does this mean the emploees are not getting enough money that suits their lifestlye? or the loan company prefers giving out loans to working class individuals?
univariate_plot('Year', 'Loan Distribution by Year.')
Year 2013 spiked up in number of loan disbursed, before dropping back drastically in 2014 what might cause this? Economic crisis? Inflation?
univariate_plot('IsBorrowerHomeowner', "Borrower's Home Owner Distribution.")
The difference between a homeowner and a non homeowner who borrowed is not that specific, but its worth noting that we still tried to find out
univariate_plot('IncomeVerifiable', "Borrower's Verifiable Income Distribution.")
The loan company made sure to not borrow to users whose income is not verifiable, good idea right? No one wants to run into a loss.
univariate_plot('LoanStatus', 'Loan Status Distribution.')
Few borrowers cancelled their loans, majority of the borrowers has current loans, while the second largest distribution are for those who have current loans.
sns.countplot(x='BorrowerState', data=new_df, order=new_df.BorrowerState.value_counts().iloc[:10].index)
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, 'CALIFORNIA'), Text(1, 0, 'TEXAS'), Text(2, 0, 'NEW YORK'), Text(3, 0, 'FLORIDA'), Text(4, 0, 'ILLINOIS'), Text(5, 0, 'Unknown'), Text(6, 0, 'GEORGIA'), Text(7, 0, 'OHIO'), Text(8, 0, 'MICHIGAN'), Text(9, 0, 'VIRGINIA')])
The plot above shows the top ten(10) borrowers state(s) in the country. California is the top borrowing state followed by Texas.
sns.countplot(x='BorrowerState', data=new_df, order=new_df.BorrowerState.value_counts().index[::-10])
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5]), [Text(0, 0, 'NORTH DAKOTA'), Text(1, 0, 'WEST VIRGINIA'), Text(2, 0, 'LOUISIANA'), Text(3, 0, 'WISCONSIN'), Text(4, 0, 'NORTH CAROLINA'), Text(5, 0, 'TEXAS')])
From the above chart, North Dakota has the lowest rate of borrowing followed by West Virginia.
Let us Check for occupation of borrower and see the top and bottom (10)
sns.countplot(x='Occupation', data=new_df, order=new_df.Occupation.value_counts().iloc[:10].index)
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, 'Other'), Text(1, 0, 'Professional'), Text(2, 0, 'Computer Programmer'), Text(3, 0, 'Executive'), Text(4, 0, 'Teacher'), Text(5, 0, 'Administrative Assistant'), Text(6, 0, 'Analyst'), Text(7, 0, 'Unknown'), Text(8, 0, 'Sales - Commission'), Text(9, 0, 'Accountant/CPA')])
Seems most of the borrowers didnt disclose their occuaptions, anonymity perhaps?
sns.countplot(x='Occupation', data=new_df, order=new_df.Occupation.value_counts().index[::-10])
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6]), [Text(0, 0, 'Student - Technical School'), Text(1, 0, 'Flight Attendant'), Text(2, 0, 'Investor'), Text(3, 0, 'Fireman'), Text(4, 0, 'Tradesman - Mechanic'), Text(5, 0, 'Laborer'), Text(6, 0, 'Unknown')])
Student-Technical School is the least borrowing occupation.
Extra: For more clarity, let us create a word cloud count for the Occupation
#import word cloud
from wordcloud import WordCloud
occupation_text = " ".join(i for i in new_df['Occupation'])
#Calling a function to create the word cloud
def draw_word_cloud(text, title):
"""Plot a word cloud """
"""param: text, title"""
"""return none"""
wordcloud = WordCloud(background_color="white", max_font_size=300, width=1024, height=1000, colormap="magma").generate_from_text(text)
plt.figure(figsize=(20,20))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.suptitle(title, fontsize=30)
plt.show();
draw_word_cloud(occupation_text, "Borrower's Occupation Word Cloud.")
In the Image in the word cloud above, we can see that Professionals and Computer Programmer are the highest loan borrowers.
Let us check for outliers in our data, using the percentile distribution Here, we will be usig the Loan Original Amount, Monthly loan Payment, Monthly income column(s) etc. to ascertain our values.
#using a function for these plots to enable fast and easy analysis
#of our distribution(s)
def univariate_boxplot(x, title):
"""plot a univariate boxplot"""
"""param:x, title"""
"""return none"""
plt.figure(figsize=(8,6),dpi = 400)
sns.boxplot(x = x, data = new_df)
plt.suptitle(title)
plt.show();
#for our Loan Original Amount data
univariate_boxplot('LoanOriginalAmount', 'Loan Original Amount Distribution.')
The 25th percentile of the loan original amount variable falls slightly before 5000 dollars, while the median falls above the 5000 dollars, the 75th percentile falls short of 15000 dollars.
# for our monthly loan payment distribution
univariate_boxplot('MonthlyLoanPayment', 'Monthly Loan Payment Distribution.')
The 25th percentile, median and 75th percentile of the monthly loan payment variable falls short of 500 dollars.
# Inevestors distribution
univariate_boxplot('Investors', 'Investors Distribution.')
The 25th percentile, median and 75th percentile of the investors variable falls short of 200 investors.
#Monthly Income
univariate_boxplot('StatedMonthlyIncome', 'Stated Monthly Income Distribution.')
The figure above depicts the distribution of the stated monthly income with outliers indicated with an astericks, due to the bulkiness of the data the first quartile and the rest cannot be visualized from this figure.
To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively, it was also changed to a categorical data type to enable easy analyzation. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from their abbreviated text to full text to enable easy comprehension by non native researchers, we also did not leave out the stated monthy income and monthy loan payment variable out. These variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.
For the distribution of term, loans disbursed on the medium term has the highest occurence, Employed individuals borrowed the most, followed by the full time category. Year 2013 had the highest number of loan disbursement, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. There was not much difference between if a borrower is a home owner or not and loans were only disbursed to those whose income were verifiable. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.
AIM:To identify the relativity of a categorical variable against another categorical variable, this would indicating how much of a categorical variable spread over another categorical variable.
# a func to plot a bivariate countplot
def bivariate_count_plot(x, hue, title):
"""plot a countplot"""
"""param: x, hue title"""
"""return none"""
plt.figure(figsize=(8,6), dpi = 400)
ax = sns.countplot(x = x, hue = hue, data = new_df)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
ax.margins(y=0.1)
plt.xlabel('Loan Status', fontsize = 15)
plt.tight_layout()
plt.plot()
plt.show();
# Function to plot laon status by IsBorrower a home owner or not
bivariate_count_plot('LoanStatus', 'IsBorrowerHomeowner', 'Loan Status by Home Owner status.')
I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too
# Function to plot laon status by term
bivariate_count_plot('LoanStatus', 'Term', 'Loan Status by Term Duration.')
The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.
bivariate_count_plot('IncomeVerifiable', 'IsBorrowerHomeowner', 'Is Homeowner(s) Income Verifiable?')
From the above plot, Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.
# Function to plot laon status by home ownership
bivariate_count_plot('LoanStatus', 'IsBorrowerHomeowner', 'Loan Status by Borrower Home Ownership.')
The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.
#Function to plot loan status by Verifiable Income
bivariate_count_plot('LoanStatus', 'IncomeVerifiable', 'Loan Status by Verifiable Income')
The current loan status 52434 has true (verifiable income) , while 4142 were false(no verifiable income). The past due category has 1857 verifiable income borrowers, and 210 non income verifiable borrowers. The completed loan status category has verifiable income borrowers of 35292 while 2792 does not have a verifiable income.
Lets evaluate the mean distribution of loan original amount and stated monthly income group by categorical variables year, loan status, and verifiable income.
# Function to plot Loan Original Amount and
# Stated Monthly Income group by a feature
def bi_bar_plot(by, groupby, title):
"""plot a barplot on groupby function"""
"""param: by, groupby, title"""
"""return ax, mean"""
ax = new_df.groupby(by= by)[['LoanOriginalAmount', 'StatedMonthlyIncome']].mean().plot(kind = 'bar', figsize = (26,12), fontsize = 16)
mean = new_df.groupby(by = groupby).mean()[['LoanOriginalAmount', 'StatedMonthlyIncome']]
plt.xticks(rotation = 360)
plt.title(title, fontsize = 25)
plt.ylabel('Mean Values', fontsize = 20)
plt.xlabel(by, fontsize = 20)
return ax, mean;
# Function to group by year
bi_bar_plot('Year', 'Year', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Year.'}, xlabel='Year', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
Year
2005 3576.681818 11122.727273
2006 4763.325262 4743.353708
2007 7049.545026 4653.942321
2008 6021.628289 4618.605869
2009 4354.858818 5091.406448
2010 4766.540340 5290.192675
2011 6692.021108 5659.263359
2012 7833.842173 5709.352376
2013 10545.065599 6156.148231
2014 11912.219520 6329.746796)
The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.
# Function to group by loan status
bi_bar_plot('LoanStatus', 'LoanStatus', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Loan Status.'}, xlabel='LoanStatus', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
LoanStatus
Cancelled 1700.000000 2608.600000
Chargedoff 6398.916694 4485.688793
Completed 6189.093239 5324.206204
Current 10360.835018 6152.938667
Defaulted 6486.798525 4366.463930
FinalPaymentInProgress 8346.121951 6311.419512
Past Due 8258.437349 5366.540397)
The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.
# Function to group by verifiable income
bi_bar_plot('IncomeVerifiable', 'IncomeVerifiable', 'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.')
(<AxesSubplot:title={'center':'Mean Loan Amount and Mean Monthly Income Bar Chart Group By Verifiable Income.'}, xlabel='IncomeVerifiable', ylabel='Mean Values'>,
LoanOriginalAmount StatedMonthlyIncome
IncomeVerifiable
False 7111.896413 5022.028838
True 8437.904368 5655.938519)
In the stated monthly income variable, the highest mean value is the true value of verifiable income variable while the lowest mean value is the false value of verifiable income variable.
The stated monthly income variable the highest mean value is true value of verifiable income variable with a lowest mean value in the false value of verifiable income variable.
Lets establish any relationship or correlation between the continuous numerical variables; stated monthly income, loan original amount, and monthly loan payment.
# define a function to plot lineplot
def line_plot(x, y, title):
"""plot a line plot"""
"""param: x, y, title"""
"""return none"""
plt.figure(figsize=(10,8), dpi= 450)
p = sns.lineplot(x = x, y = y, data = new_df)
plt.title(title);
# call func to plot lineplot of stated monthly income against loan original amount
line_plot('StatedMonthlyIncome', 'LoanOriginalAmount', 'Line Plot Depicting Relatioship Between Stated Monthly Income and Loan Original Amount.')
There is no positive correlation between the two variables as the figure above shows that majority of the values fall between 0 and 35000 dollars
#Function to depict lineplot of monthly loan payment against original loan amount
line_plot('LoanOriginalAmount', 'MonthlyLoanPayment', 'Line Plot Depicting Relationship Between Monthly Loan Payment and Loan Original Amount.')
As the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables.
line_plot('Year', 'Investors', 'Line Plot Depicting Relationship Between Years and Investors.')
From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.
SUMMARY
I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too
The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above.
Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.
The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.
The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.
The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.
There is no positive correlation between stated monthly income and loan original amount
As the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables.
From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.
AIM: showcase the relationship between three variables, two continous numerical variable and a categorical variable.
# Function to plot scatter plot loan original amount against current days of delinquency
def scatter(hue, title):
"""plot a scatterplot"""
"""param: hue, title"""
"""return none"""
plt.figure(figsize=(12,10), dpi = 450)
sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = new_df)
plt.xlabel('Loan Original Amount(Dollars)')
plt.ylabel('Monthly Loan Payment(Dollars)')
plt.title(title);
#Function to PLot Scatter plot on group by loan term
scatter('Term', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Term.')
There exists a positive relationship between the loan original amount and monthly loan payment, the data point(s) are spread across the plot categorized by term of loan.
#Function to PLot Scatter plot by year
scatter('Year', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Year.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by year of loan.
#Function to plot scatterplot on group by verifiable income
scatter('IncomeVerifiable', 'Original Loan Amount Against Current Days of Delinquency Group by Verifiable Income.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by verifiable income.
# Function to plot scatter plot group by borrower home ownership status
scatter('IsBorrowerHomeowner', 'Original Loan Amount Against Current Days of Delinquency Group by Borrower Home Ownership.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by home ownership.
# Function to plot scatter plot on group by employment status
scatter('EmploymentStatus', 'Original Loan Amount Against Current Days of Delinquency Group Employment Status.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by employment status.
# Function to plot scatter plot on group by loan status
scatter('LoanStatus', 'Original Loan Amount Against Current Days of Delinquency Group Loan Status.')
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across the plot categorized by loan status.
Let us check for the correlation between numerical variables based on their linear properties by plotting a heatmap.
plt.figure(figsize=(10,8), dpi = 400)
colormap = sns.color_palette('Greens')
sns.heatmap(new_df.corr(), annot = True, cmap = colormap, center = 0)
plt.title('Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
Text(0.5, 1.0, 'Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
We can deduce from the heatmap that, we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the monthly loan payment seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column.
From earlier findings there exist a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column.
To analyse the loan with respect the year, the loan origination date column was converted from object datatype to datetime, afterwards the year was extracted from the datetime before setting the data type of the extracted year column as categorical variable, also the loan term values was trasform from the original values 12months, 36months, and 60months to short term, medium term and long term respectively, it was also changed to a categorical data type to enable easy analyzation. The loan status has values respresenting past due in a number of categories of days, these values were replaced with a single value named 'past due' regardless of the number of days. The borrower state values were transformed from their abbreviated text to full text to enable easy comprehension by non native researchers, we also did not leave out the stated monthy income and monthy loan payment variable out. These variables were converted from float to integer for consistency with the loan amount data type. The occupation column was transformed from object data type to categorical data type.
For the distribution of term, loans disbursed on the medium term has the highest occurence, Employed individuals borrowed the most, followed by the full time category. Year 2013 had the highest number of loan disbursement, followed by the year 2012 and 2014 respectively at second and third position, the least loan disbursement occured in the year 2005 with a occurence of 22 loan disbursement. There was not much difference between if a borrower is a home owner or not and loans were only disbursed to those whose income were verifiable. The distibution of the loan original amount is right-skewed, a case of symmetrical distribution. Most of the loan original amount are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 5000 dollars, there are outliers in the ranges of 32000 dollars and 35000 dollars. The distribution of monthly loan payment is right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars.
I wanted to check if a borrower status as a home owner will affect the loan Status as being paid or not but found out, it did not really affect it. The current loan status; the highest, shows that home owners are borrowers too. The medium term has the highest occurence of loan duration in the distribution from earlier findings, how much does this spread over the loan status category is what the figure above depicts, for the current loan status 20127 were on long term, 36387 were on medium term while just 62 were on the short term. The past due category has a spread of 684 long term duration, 1373 medium term duration, and 10 short term duration. other categories can be visualized from the figure above. Home owners whose occupations were verifiable were 54073 and those whose werent are 3405. Meaning a larger percentage of home owners also has their income verifiable.The current loan status 30478 are home owners, while 26098 were not home owners. The past due category has a spread of 1003 home ownership, and 1064 no home ownership. The completed loan status category has home ownership of 18280 while 19794 does not own a home.The loan original amount has the highest mean value in the year 2014 with a lowest mean value in the year 2005, for the stated monthly income the year 2005 has the highest mean value while year 2008 has the lowest mean value.The loan original amount has the highest mean value in the current loan status category and a lowest mean value in the cancelled loan category, for the stated monthly income the loan status final payment in progress has the highest mean value with loan status cancelled with the lowest mean values.There is no positive correlation between stated monthly income and loan original amount, as the original loan amount increases the monthly loan payment also increases, meaning there is a rush to pay off the loan as quickly as possible which means, there is a positive correlation between our two variables. From year 2005 to mid-2008 and 2010, there was a shar increase in investors invloved in the loan business, but this dropped from mid 2008 to 2014.
Lastly, From earlier findings there existed a positive relationship between the loan original amount and monthly loan payment, the data point are spread across a scatterplot categorized by term of loan, year, verifiable income, home ownership, loan status, and employment status plotted on different scatter plot to better depict the spread. We could also deduce from the findings the correlation of continuous numerical variable by the heatmap, we could see we have a positive correlation between loan original amount and monthly loan payment with a correlation coefficient of 0.93, the stated monthly income and the original loan amount seems to have no correlation between them as well as the income verifiable and the Loans Current Days deliquent column. I just had to make sure it was correct by crosschecking using the Multivariate Exploration option